Skip to content

kv/lease: adopt CLOCK_MONOTONIC_RAW for lease-read path (#551)#604

Merged
bootjp merged 7 commits intomainfrom
feat/monoclock
Apr 23, 2026
Merged

kv/lease: adopt CLOCK_MONOTONIC_RAW for lease-read path (#551)#604
bootjp merged 7 commits intomainfrom
feat/monoclock

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 23, 2026

Go's time.Now() is backed by CLOCK_MONOTONIC, which POSIX lets NTP slew at up to 500 ppm. At the current 300 ms leaseSafetyMargin the worst-case slew error over a 700 ms lease window is ~0.35 ms — three orders of magnitude inside the margin — but lease safety should not rest on NTP being well-behaved. TiKV uses CLOCK_MONOTONIC_RAW for the same reason; this change adopts it in elastickv.

New internal/monoclock package wraps clock_gettime(CLOCK_MONOTONIC_RAW) via x/sys/unix on Linux/Darwin/FreeBSD, with a runtime-monotonic fallback on other platforms. leaseState, quorumAckTracker, the etcd engine's single-node ack, LeaseProvider.LastQuorumAck, and all coordinator LeaseRead / Dispatch / refreshLeaseAfterDispatch sites sample monoclock.Now() instead of time.Now(). Tests updated to match; quorum-ack tracker pins the raw-monotonic frame with a regression test. docs/lease_read_design.md §3.1-3.2 expanded with the rationale.

Summary by CodeRabbit

  • Documentation

    • Updated lease design documentation to reflect timing mechanism updates.
  • Internal Improvements

    • Transitioned lease timing to use monotonic clock measurements instead of wall-clock time for improved reliability.
    • Introduced monotonic clock abstraction for intra-process timing operations.
  • Testing

    • Added comprehensive test coverage for monotonic clock semantics and lease timing behavior.

Go's time.Now() is backed by CLOCK_MONOTONIC, which POSIX lets NTP
slew at up to 500 ppm. At the current 300 ms leaseSafetyMargin the
worst-case slew error over a 700 ms lease window is ~0.35 ms — three
orders of magnitude inside the margin — but lease safety should not
rest on NTP being well-behaved. TiKV uses CLOCK_MONOTONIC_RAW for the
same reason; this change adopts it in elastickv.

New internal/monoclock package wraps clock_gettime(CLOCK_MONOTONIC_RAW)
via x/sys/unix on Linux/Darwin/FreeBSD, with a runtime-monotonic
fallback on other platforms. leaseState, quorumAckTracker, the etcd
engine's single-node ack, LeaseProvider.LastQuorumAck, and all
coordinator LeaseRead / Dispatch / refreshLeaseAfterDispatch sites
sample monoclock.Now() instead of time.Now(). Tests updated to match;
quorum-ack tracker pins the raw-monotonic frame with a regression
test. docs/lease_read_design.md §3.1-3.2 expanded with the rationale.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 47 minutes and 37 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 47 minutes and 37 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 74ad9d41-dd38-47b7-b4ab-1fe1e6f371e8

📥 Commits

Reviewing files that changed from the base of the PR and between 587717e and c5c27b7.

📒 Files selected for processing (1)
  • internal/raftengine/engine.go
📝 Walkthrough

Walkthrough

This PR introduces a new monoclock package providing an intra-process monotonic-raw clock abstraction, then systematically replaces wall-clock time.Time usage with monoclock.Instant throughout the lease system. It updates the lease state implementation to use atomic nanosecond storage, modifies the raft engine and quorum tracker interfaces, and adjusts all coordinator and lease logic to consume/produce monotonic instants instead of wall-clock time.

Changes

Cohort / File(s) Summary
Monoclock Package
internal/monoclock/monoclock.go, internal/monoclock/monoclock_unix.go, internal/monoclock/monoclock_fallback.go, internal/monoclock/monoclock_test.go
New package providing Instant type wrapping raw nanosecond counters with platform-specific implementations (Linux/Darwin using CLOCK_MONOTONIC_RAW via golang.org/x/sys/unix, fallback using time.Since). Exposes helpers for comparisons, arithmetic, and conversions.
Design Documentation
docs/design/2026_04_20_implemented_lease_read.md
Updated lease design to replace expiry (*time.Time) with expiryNanos (int64), justify monoclock.Now() over time.Now() by addressing NTP slew hazards, and reattribute safety margin to inter-host monotonic drift (≤50 ppm) rather than wall-clock skew.
Raft Engine Interfaces & Quorum Tracking
internal/raftengine/engine.go, internal/raftengine/etcd/engine.go, internal/raftengine/etcd/quorum_ack.go, internal/raftengine/etcd/quorum_ack_test.go
LeaseProvider.LastQuorumAck() now returns monoclock.Instant instead of time.Time. Quorum tracker replaces quorumAckUnixNano with quorumAckMonoNs (atomic nanoseconds). Single-node fast-path uses singleNodeLeaderAckMonoNs instead of wall-clock nanoseconds. Tests updated to assert zero using monoclock.Zero and verify monotonic-raw timestamps.
Lease State Implementation
kv/lease_state.go, kv/lease_state_test.go
Core refactor: replaces atomic *time.Time with atomic pointer to immutable leaseSlot (containing expiryNanos + gen). valid() consumes monoclock.Instant with fail-closed handling for zero. extend() uses pointer-identity CAS with rollback protection. invalidate() increments generation and installs zero-expiry slot. New tests verify CAS rollback behavior and clock-failure handling.
Coordinator Lease Logic
kv/coordinator.go, kv/lease_read_test.go, kv/raft_engine.go, kv/sharded_coordinator.go, kv/sharded_lease_test.go
Dispatch and LeaseRead sample from monoclock.Now() instead of time.Now(). engineLeaseAckValid compares monotonic instants and rejects fast-path if now is zero (clock failure). Test helpers updated to use monoclock.Instant and sample from monotonic clock.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The lease state refactor (pointer-identity CAS rollback logic with atomic slot swaps) and corresponding test coverage are dense and require careful verification. Most other changes follow a consistent pattern replacement, but the concentration of complex atomic logic in lease_state.go and the interplay across raft engine, coordinator, and lease-state boundaries demand sustained reasoning across multiple files.

Possibly related issues

Possibly related PRs

Poem

🐰 A hop through monotonic time,
No wall-clock stumbles, no NTP chimes,
Instants atomic, raw and true,
Leases now race-safe and new! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main change: adopting CLOCK_MONOTONIC_RAW for the lease-read path through a new monoclock package, with specific file references and a PR number for traceability.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/monoclock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b00411fb5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/monoclock/monoclock_unix.go Outdated
bootjp added 2 commits April 24, 2026 03:21
golang.org/x/sys/unix does not export CLOCK_MONOTONIC_RAW on FreeBSD
(its kernel exposes CLOCK_MONOTONIC_PRECISE, a different clock), so
the previous `linux || darwin || freebsd` tag broke
GOOS=freebsd cross-compiles with "undefined: unix.CLOCK_MONOTONIC_RAW".
FreeBSD now falls into the runtime-monotonic fallback alongside
Windows / Plan 9; lease safety on FreeBSD reverts to the NTP-slewed
baseline, still well inside the 300 ms safety margin.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 551ceaad3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +21 to +22
if err := unix.ClockGettime(unix.CLOCK_MONOTONIC_RAW, &ts); err != nil {
return 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid returning zero when monotonic clock read fails

Returning 0 on ClockGettime error turns monoclock.Now() into the minimum possible instant, and that does not fail closed in the lease path: Coordinate.LeaseRead falls back to the caller-side check c.lease.valid(now) (kv/coordinator.go) where now=0 is treated as "before expiry" for any non-zero lease, so once a lease is warmed the node can keep serving local reads without LinearizableRead under persistent clock failures (for example, sandbox/seccomp environments where clock_gettime is denied). This can bypass lease expiration safety during leader isolation; the error path should invalidate/disable lease fast-paths instead of producing a valid timestamp.

Useful? React with 👍 / 👎.

When unix.ClockGettime errors (realistically: seccomp / sandbox denies
it), monoclock.Now() returns the zero Instant. Without additional
guards, leaseState.valid(zero) compared the zero against any warmed
expiry and returned true (zero.Before(positive) holds), so a node
whose clock read persistently failed could keep serving stale reads
off a once-warmed lease -- bypassing LinearizableRead even during
leader isolation.

Teach leaseState.valid and engineLeaseAckValid to reject a zero now
explicitly; the lease fast paths fall through to LinearizableRead,
which fails closed on a truly broken clock and succeeds as soon as
the clock starts returning real readings again.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 03edaa1626

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/lease_state.go Outdated
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

@codex review

The previous leaseState stored expiry as atomic.Int64 and rolled back a
raced extend via expiryNanos.CompareAndSwap(target, 0). When two
extenders computed the same target (clock-granularity tie across a
leader-loss invalidate), the stale extender's value-gated rollback
could clobber the fresh extender's still-valid lease, breaking the
generation guard's intent.

Replace the int64+gen pair with atomic.Pointer[leaseSlot]. Each
successful extend installs a freshly-allocated *leaseSlot, so pointer
identity alone disambiguates co-targeted extenders: a rollback CAS
against the extender's own *leaseSlot cannot match a pointer a
concurrent winner has already installed, even when the expiry values
are identical. Same monotonic generation-counter semantics preserved
inside the slot.

Add a regression test (TestLeaseState_RollbackCASUsesPointerIdentity)
that deterministically simulates the clock-tie race by manipulating
internal state: a pointer-gated CAS fails while a value-gated CAS
would have erased the fresh lease. Positive-control test pins the
in-place clearing path so we do not silently stop rolling back when
we should.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

Codex review items — addressed

  • Item 1 (P1, FreeBSD build break) — commit 54a6f98b
    Option B: dropped freebsd from monoclock_unix.go's build tag (//go:build linux || darwin), FreeBSD now falls into the existing monoclock_fallback.go using Go's runtime monotonic clock. GOOS=freebsd GOARCH=amd64 go build ./... succeeds.

  • Item 2 (P1, clock failure must fail-closed) — commit 03edaa16
    Kept the (ns int64) return (no signature change) and extended the fail-closed contract to every consumer: leaseState.valid and engineLeaseAckValid both reject a zero monoclock.Instant explicitly, so callers sampling monoclock.Zero (clock_gettime denied under seccomp) fall through to LinearizableRead. Regression tests in kv/lease_state_test.go and kv/lease_read_test.go pin the invariant.

  • Item 3 (P3, lease rollback CAS needs unique token) — commit 587717eb
    Replaced the atomic.Int64 + atomic.Uint64 pair with atomic.Pointer[leaseSlot] where each leaseSlot wraps {expiryNanos, gen}. Each successful extend installs a freshly-allocated slot, so rollback is pointer-identity CAS on the extender's own slot — a concurrent extender that captured a post-invalidate generation and computed the same expiry (clock-granularity tie) cannot be clobbered because its slot is a distinct allocation. New TestLeaseState_RollbackCASUsesPointerIdentity deterministically simulates the clock-tie race.

Checks

  • GOOS=freebsd GOARCH=amd64 go build ./... — clean
  • go test -race ./kv/... ./internal/monoclock/... ./internal/raftengine/... — all green
  • go vet ./... — clean
  • golangci-lint run ./kv/... ./internal/monoclock/... — 0 issues

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/raftengine/etcd/quorum_ack.go (1)

59-71: ⚠️ Potential issue | 🟡 Minor

Minor: failed clock_gettime samples silently get buried by the descending sort.

monoclock.Now().Nanos() returns 0 on ClockGettime failure. That 0 gets stored into peerAcks[peerID], and in recomputeLocked the descending sort sinks it to the bottom — so a single peer whose ack was sampled during a seccomp/sandbox clock failure effectively disappears from the quorum computation (another real ack gets promoted into the boundary slot). The tracker reports a satisfied quorum that is one peer short of what the accounting suggests.

This is fail-open in exactly the scenario the rest of the PR is fail-closed about. It's a narrow window (syscall failure on the leader between the ack and recordAck), but cheap to make consistent: either skip the update when now == 0, or filter zeros out of ackBuf before sorting.

🛡️ Proposed fix (option A: skip store on failed sample)
 func (t *quorumAckTracker) recordAck(peerID uint64, followerQuorum int) {
 	if followerQuorum <= 0 {
 		return
 	}
 	now := monoclock.Now().Nanos()
+	if now == 0 {
+		// clock_gettime failed; do not poison peerAcks with a
+		// sentinel that would be silently buried by the sort.
+		return
+	}
 	t.mu.Lock()

Also applies to: 108-126

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/raftengine/etcd/quorum_ack.go` around lines 59 - 71, The issue is
that monoclock.Now().Nanos() can return 0 on failure and that zero timestamp is
stored into quorumAckTracker.peerAcks, which then gets sorted down in
recomputeLocked and can falsely satisfy quorum; change
quorumAckTracker.recordAck to ignore failed clock samples by skipping the map
update when now == 0 (i.e., do not set t.peerAcks[peerID] when the sampled
timestamp is zero), and also add a defensive filter in recomputeLocked to ignore
zero timestamps when building ackBuf so any existing zeros cannot affect the
descending sort; refer to quorumAckTracker.recordAck,
quorumAckTracker.recomputeLocked, and the peerAcks map when making these
changes.
kv/coordinator.go (1)

285-306: ⚠️ Potential issue | 🟠 Major

Don’t warm the lease from a zero clock sample.

Line 306 and Line 458 both extend the caller-side lease from dispatchStart.Add(...) / now.Add(...). If monoclock.Now() failed and returned monoclock.Zero, Instant.Add still produces a positive counter, so the lease can become briefly valid instead of failing closed on clock-read failure.

Proposed fix
 func (c *Coordinate) refreshLeaseAfterDispatch(resp *CoordinateResponse, err error, dispatchStart monoclock.Instant, expectedGen uint64) {
 	if err != nil {
 		// Only invalidate on errors that actually signal leadership
@@
 	lp, ok := c.engine.(raftengine.LeaseProvider)
 	if !ok {
 		return
 	}
+	if dispatchStart.IsZero() {
+		return
+	}
 	c.lease.extend(dispatchStart.Add(lp.LeaseDuration()), expectedGen)
 }
@@
 	idx, err := c.LinearizableRead(ctx)
 	if err != nil {
 		if isLeadershipLossError(err) {
 			c.lease.invalidate()
 		}
 		return 0, err
 	}
-	c.lease.extend(now.Add(leaseDur), expectedGen)
+	if !now.IsZero() {
+		c.lease.extend(now.Add(leaseDur), expectedGen)
+	}
 	return idx, nil
 }

Also applies to: 428-459

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/coordinator.go` around lines 285 - 306, The code extends the caller-side
lease using a clock sample that may be monoclock.Zero (e.g., in
refreshLeaseAfterDispatch and the corresponding read-path that uses
now.Add(...)), which can erroneously warm a lease when clock reading failed;
update both places (the call site that does dispatchStart.Add(...) in
refreshLeaseAfterDispatch and the other site that does now.Add(...)) to first
detect a zero/failed clock sample (compare the Instant to monoclock.Zero or use
its IsZero helper) and, if zero, avoid extending the lease (preferably
invalidate or skip extend so we fail closed) instead of calling Instant.Add;
keep the existing logic for non-zero Instants.
🧹 Nitpick comments (1)
internal/monoclock/monoclock_test.go (1)

18-30: Consider dropping the time.Sleep and the b == a branch.

time.Sleep(100µs) on several CI schedulers (Windows/macOS lower-resolution timers, loaded Linux runners) routinely returns after a full tick, which is fine for this test — but the assertion require.False(t, b.Before(a)) combined with b.After(a) || b == a is equivalent to just require.False(t, b.Before(a)). Simplifying makes the intent (monotonic non-regression) clearer without relying on struct-equality of an unexported-field type.

♻️ Suggested simplification
 	a := Now()
 	require.False(t, a.IsZero(), "Now must return non-zero instant on supported platforms")
-	time.Sleep(100 * time.Microsecond)
 	b := Now()
 	require.False(t, b.Before(a), "monotonic-raw clock must not regress across calls")
-	require.True(t, b.After(a) || b == a)

Also drops the now-unused time import.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/monoclock/monoclock_test.go` around lines 18 - 30, Remove the sleep
and the redundant equality check in TestNow_IsNonZeroAndMonotonic: keep the call
to Now() (a := Now(), b := Now()), assert a.IsZero() is false and replace the
final two assertions with a single monotonic non-regression check using
require.False(t, b.Before(a)). Update imports to drop unused time. This touches
the TestNow_IsNonZeroAndMonotonic test and references Now(), IsZero(), Before(),
and After() in the assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/2026_04_20_implemented_lease_read.md`:
- Around line 92-125: The design doc still describes the old atomic layout
(leaseState with gen/expiryNanos) and caller-only pseudocode, but the shipped
implementation uses atomic.Pointer[leaseSlot] and the primary LastQuorumAck()
fast path; update Sections 3.1 and 3.5 (and lines referenced 272-304) to match
the implementation by replacing the gen/expiryNanos description with the
leaseSlot/atomic.Pointer design, document how invalidate(), extend(), and
LeaseRead interact with the atomic.Pointer-based slot and how LastQuorumAck()
provides the fast path, and update the pseudocode to reflect the
coordinator-side behavior in lease_state.go and coordinator.go (including how
CAS on leaseSlot and generation/versioning is performed and how fallbacks to
LinearizableRead occur).

In `@internal/raftengine/engine.go`:
- Around line 99-125: The LeaseProvider contract comment for LastQuorumAck()
must be updated to match the actual fail-closed checks used by
engineLeaseAckValid: include the additional guard `!now.IsZero()` in the safety
formula and explicitly state that callers must sample `now := monoclock.Now()`
and check `!now.IsZero()`; also update the platform note to remove the absolute
claim that FreeBSD uses CLOCK_MONOTONIC_RAW and instead document that FreeBSD
may fall back (or refer to internal/monoclock behavior) so implementers don't
assume RAW is always available. Ensure the comment references LastQuorumAck(),
engineLeaseAckValid, monoclock.Now(), and LeaseDuration() so readers can locate
the related logic.

---

Outside diff comments:
In `@internal/raftengine/etcd/quorum_ack.go`:
- Around line 59-71: The issue is that monoclock.Now().Nanos() can return 0 on
failure and that zero timestamp is stored into quorumAckTracker.peerAcks, which
then gets sorted down in recomputeLocked and can falsely satisfy quorum; change
quorumAckTracker.recordAck to ignore failed clock samples by skipping the map
update when now == 0 (i.e., do not set t.peerAcks[peerID] when the sampled
timestamp is zero), and also add a defensive filter in recomputeLocked to ignore
zero timestamps when building ackBuf so any existing zeros cannot affect the
descending sort; refer to quorumAckTracker.recordAck,
quorumAckTracker.recomputeLocked, and the peerAcks map when making these
changes.

In `@kv/coordinator.go`:
- Around line 285-306: The code extends the caller-side lease using a clock
sample that may be monoclock.Zero (e.g., in refreshLeaseAfterDispatch and the
corresponding read-path that uses now.Add(...)), which can erroneously warm a
lease when clock reading failed; update both places (the call site that does
dispatchStart.Add(...) in refreshLeaseAfterDispatch and the other site that does
now.Add(...)) to first detect a zero/failed clock sample (compare the Instant to
monoclock.Zero or use its IsZero helper) and, if zero, avoid extending the lease
(preferably invalidate or skip extend so we fail closed) instead of calling
Instant.Add; keep the existing logic for non-zero Instants.

---

Nitpick comments:
In `@internal/monoclock/monoclock_test.go`:
- Around line 18-30: Remove the sleep and the redundant equality check in
TestNow_IsNonZeroAndMonotonic: keep the call to Now() (a := Now(), b := Now()),
assert a.IsZero() is false and replace the final two assertions with a single
monotonic non-regression check using require.False(t, b.Before(a)). Update
imports to drop unused time. This touches the TestNow_IsNonZeroAndMonotonic test
and references Now(), IsZero(), Before(), and After() in the assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 290b4a60-fb1e-433a-8e33-e76dae53d7c1

📥 Commits

Reviewing files that changed from the base of the PR and between 5b4af15 and 587717e.

📒 Files selected for processing (16)
  • docs/design/2026_04_20_implemented_lease_read.md
  • internal/monoclock/monoclock.go
  • internal/monoclock/monoclock_fallback.go
  • internal/monoclock/monoclock_test.go
  • internal/monoclock/monoclock_unix.go
  • internal/raftengine/engine.go
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/quorum_ack.go
  • internal/raftengine/etcd/quorum_ack_test.go
  • kv/coordinator.go
  • kv/lease_read_test.go
  • kv/lease_state.go
  • kv/lease_state_test.go
  • kv/raft_engine.go
  • kv/sharded_coordinator.go
  • kv/sharded_lease_test.go

Comment on lines 92 to +125
```go
type leaseState struct {
gen atomic.Uint64 // bumped by invalidate()
expiry atomic.Pointer[time.Time] // nil = expired / invalidated
gen atomic.Uint64 // bumped by invalidate()
expiryNanos atomic.Int64 // 0 = expired / invalidated; else monoclock.Instant nanos
}
```

- `expiry == nil` or `time.Now() >= *expiry`: lease is expired. The next
`LeaseRead` falls back to `LinearizableRead` and refreshes the lease on
success.
- `time.Now() < *expiry`: lease is valid. `LeaseRead` returns immediately
without contacting the Raft layer.
- `invalidate()` increments `gen` before clearing `expiry`. `extend()`
captures `gen` at entry and, after its CAS lands, undoes its own
write (via CAS on the pointer it stored) iff `gen` has moved. This
prevents a Dispatch that succeeded just before a leader-loss
invalidate from resurrecting the lease milliseconds after it was
cleared. A fresh `extend()` that captured the post-invalidate
generation is left intact because it stored a different pointer.

The lock-free form lets readers do one atomic load + one wall-clock compare
on the fast path.
All timestamps on the lease path come from `internal/monoclock`, which
reads `CLOCK_MONOTONIC_RAW` via `clock_gettime(3)` on Linux and Darwin
(falling back to Go's runtime monotonic on other platforms — FreeBSD
included, since `golang.org/x/sys/unix` does not export
`CLOCK_MONOTONIC_RAW` on FreeBSD).
The raw monotonic clock is immune to NTP rate adjustment and wall-clock
step events — TiKV's lease path makes the same choice. Go's
`time.Now()` is not sufficient: its embedded monotonic component is
still NTP-slewed at up to 500 ppm under POSIX, and a misconfigured or
abused time daemon can exceed that cap. See §3.2 on why the safety
argument should not rest on NTP behaving.

- `expiryNanos == 0` or `monoclock.Now() >= expiry`: lease is expired.
The next `LeaseRead` falls back to `LinearizableRead` and refreshes
the lease on success.
- `monoclock.Now() < expiry`: lease is valid. `LeaseRead` returns
immediately without contacting the Raft layer.
- `invalidate()` increments `gen` before clearing `expiryNanos`.
`extend()` captures `gen` at entry and, after its CAS lands, undoes
its own write (via CAS on the exact value it wrote) iff `gen` has
moved. This prevents a Dispatch that succeeded just before a
leader-loss invalidate from resurrecting the lease milliseconds
after it was cleared. A fresh `extend()` that captured the
post-invalidate generation is left intact because its CAS already
replaced the earlier target.

The lock-free form lets readers do one atomic load + one monotonic-raw
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the design doc to describe the implementation that ships in this PR.

Section 3.1 still shows the old gen/expiryNanos atomic layout, and the Section 3.5 pseudocode still describes only the caller-side lease path. The code in kv/lease_state.go and kv/coordinator.go now uses atomic.Pointer[leaseSlot] plus the primary LastQuorumAck() fast path, so the document is already out of sync.

Also applies to: 272-304

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_20_implemented_lease_read.md` around lines 92 - 125, The
design doc still describes the old atomic layout (leaseState with
gen/expiryNanos) and caller-only pseudocode, but the shipped implementation uses
atomic.Pointer[leaseSlot] and the primary LastQuorumAck() fast path; update
Sections 3.1 and 3.5 (and lines referenced 272-304) to match the implementation
by replacing the gen/expiryNanos description with the leaseSlot/atomic.Pointer
design, document how invalidate(), extend(), and LeaseRead interact with the
atomic.Pointer-based slot and how LastQuorumAck() provides the fast path, and
update the pseudocode to reflect the coordinator-side behavior in lease_state.go
and coordinator.go (including how CAS on leaseSlot and generation/versioning is
performed and how fallbacks to LinearizableRead occur).

Comment thread internal/raftengine/engine.go
…ed guards

Two staleness bugs in the interface docstring:

1. The safety formula omitted !now.IsZero(), which engineLeaseAckValid
   enforces after the fail-closed fix. Implementers copying the
   contract comment would miss the guard that prevents a seccomp /
   clock_gettime failure from resurrecting a warmed lease.
2. The platform note listed FreeBSD as a CLOCK_MONOTONIC_RAW target,
   but golang.org/x/sys/unix does not export the constant there and
   internal/monoclock routes FreeBSD through the runtime-monotonic
   fallback.

Update both so the interface docs match the actual behaviour.
@bootjp bootjp enabled auto-merge April 23, 2026 19:29
@bootjp bootjp merged commit eda7253 into main Apr 23, 2026
8 checks passed
@bootjp bootjp deleted the feat/monoclock branch April 23, 2026 19:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5c27b7b19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// persistent syscall failure cannot leave a warmed lease valid.
func nowNanos() int64 {
var ts unix.Timespec
if err := unix.ClockGettime(unix.CLOCK_MONOTONIC_RAW, &ts); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use a suspend-aware clock for lease expiration

Selecting CLOCK_MONOTONIC_RAW in internal/monoclock/monoclock_unix.go makes lease age checks depend on a clock that does not advance during system suspend, but kv/engineLeaseAckValid and lease checks use now.Sub(ack) < leaseDur as if elapsed real time always advances. In a VM/host pause or cgroup freezer event, a former leader can resume with StateLeader and a still-“fresh” lease even though another node may have elected and committed writes during the pause, allowing stale reads until step-down traffic is processed. This should use a suspend-aware source (or explicitly invalidate leases on resume) to preserve lease safety across suspend/resume.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant